AITopics

2506.12286

Country: North America > United States > Indiana (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Yanuganti, Vasudha, Puri, Ishaan, Chhatre, Swapnil, Singh, Mantinder, Jallepalli, Ashok, Shrivastava, Hritvik, Sharma, Pradeep Kumar

Repository-Aware File Path Retrieval via Fine-Tuned LLMs

arXiv.org Artificial IntelligenceOct-13-2025

Modern codebases make it hard for developers and AI coding assistants to find the right source files when answering questions like "How does this feature work?" or "Where was the bug introduced?" Traditional code search (keyword or IR based) often misses semantic context and cross file links, while large language models (LLMs) understand natural language but lack repository specific detail. We present a method for file path retrieval that fine tunes a strong LLM (Qwen3-8B) with QLoRA and Unsloth optimizations to predict relevant file paths directly from a natural language query. To build training data, we introduce six code aware strategies that use abstract syntax tree (AST) structure and repository content to generate realistic question-answer pairs, where answers are sets of file paths. The strategies range from single file prompts to hierarchical repository summaries, providing broad coverage. We fine tune on Python projects including Flask, Click, Jinja, FastAPI, and PyTorch, and obtain high retrieval accuracy: up to 91\% exact match and 93\% recall on held out queries, clearly beating single strategy training. On a large codebase like PyTorch (about 4,000 Python files), the model reaches 59\% recall, showing scalability. We analyze how multi level code signals help the LLM reason over cross file context and discuss dataset design, limits (for example, context length in very large repos), and future integration of retrieval with LLM based code intelligence.

large language model, machine learning, natural language, (20 more...)

2510.0885

Genre:

Research Report (0.64)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Fehr, Fabio, Sivaprasad, Prabhu Teja, Franceschi, Luca, Zappella, Giovanni

CoRet: Improved Retriever for Code Editing

arXiv.org Artificial IntelligenceJun-2-2025

In this paper, we introduce CoRet, a dense retrieval model designed for code-editing tasks that integrates code semantics, repository structure, and call graph dependencies. The model focuses on retrieving relevant portions of a code repository based on natural language queries such as requests to implement new features or fix bugs. These retrieved code chunks can then be presented to a user or to a second code-editing model or agent. To train CoRet, we propose a loss function explicitly designed for repository-level retrieval. On SWE-bench and Long Code Arena's bug localisation datasets, we show that our model substantially improves retrieval recall by at least 15 percentage points over existing models, and ablate the design choices to show their importance in achieving these results.

large language model, machine learning, natural language, (18 more...)

2505.24715

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

arXiv.org Artificial IntelligenceFeb-1-2025

OrcaLoca: An LLM Agent Framework for Software Issue Localization

Yu, Zhongming, Zhang, Hejia, Zhao, Yujie, Huang, Hanxian, Yao, Matrix, Ding, Ke, Zhao, Jishen

Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.

artificial intelligence, large language model, natural language, (14 more...)

2502.0035

Country: North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceOct-14-2024

EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations

Feng, Zhangchi, Kuang, Dongdong, Wang, Zhongyuan, Nie, Zhijie, Zheng, Yaowei, Zhang, Richong

This paper presents EasyRAG, a simple, lightweight, and efficient retrieval-augmented generation framework for automated network operations. Our framework has three advantages. The first is accurate question answering. We designed a straightforward RAG scheme based on (1) a specific data processing workflow (2) dual-route sparse retrieval for coarse ranking (3) LLM Reranker for reranking (4) LLM answer generation and optimization. This approach achieved first place in the GLM4 track in the preliminary round and second place in the GLM4 track in the semifinals. The second is simple deployment. Our method primarily consists of BM25 retrieval and BGE-reranker reranking, requiring no fine-tuning of any models, occupying minimal VRAM, easy to deploy, and highly scalable; we provide a flexible code library with various search and generation strategies, facilitating custom process implementation. The last one is efficient inference. We designed an efficient inference acceleration scheme for the entire coarse ranking, reranking, and generation process that significantly reduces the inference latency of RAG while maintaining a good level of accuracy; each acceleration scheme can be plug-and-play into any component of the RAG process, consistently enhancing the efficiency of the RAG system. Our code and data are released at \url{https://github.com/BUAADreamer/EasyRAG}.

large language model, machine learning, natural language, (20 more...)

2410.10315

Country:

Asia > Singapore (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Martin, Valeria, Venable, K. Brent, Morgan, Derek

Development and Application of a Sentinel-2 Satellite Imagery Dataset for Deep-Learning Driven Forest Wildfire Detection

arXiv.org Artificial IntelligenceSep-24-2024

Forest loss due to natural events, such as wildfires, represents an increasing global challenge that demands advanced analytical methods for effective detection and mitigation. To this end, the integration of satellite imagery with deep learning (DL) methods has become essential. Nevertheless, this approach requires substantial amounts of labeled data to produce accurate results. In this study, we use bi-temporal Sentinel-2 satellite imagery sourced from Google Earth Engine (GEE) to build the California Wildfire GeoImaging Dataset (CWGID), a high-resolution labeled satellite imagery dataset with over 100,000 labeled before and after forest wildfire image pairs for wildfire detection through DL. Our methods include data acquisition from authoritative sources, data processing, and an initial dataset analysis using three pre-trained Convolutional Neural Network (CNN) architectures. Our results show that the EF EfficientNet-B0 model achieves the highest accuracy of over 92% in detecting forest wildfires. The CWGID and the methodology used to build it, prove to be a valuable resource for training and testing DL architectures for forest wildfire detection.

architecture, dataset, wildfire, (16 more...)

2409.1638

Country:

North America > United States > California (0.25)
North America > United States > Florida > Escambia County > Pensacola (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceMar-19-2023, 07:15:19 GMT

Giving YOLOv8 a Second Look (Part 1)

Welcome to the first part in our three part series on YOLOv8! In this series, we'll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between. Throughout the series, we will be using two libraries: FiftyOne, the open source computer vision toolkit, and Ultralytics, the library that will give us access to YOLOv8. In Part 1, you'll learn how to generate, load, and visualize YOLOv8 predictions. In Part 2, we'll show you how to evaluate the quality of YOLOv8 model predictions.

detection, prediction, yolov8, (17 more...)

Technology: Information Technology > Artificial Intelligence > Vision (0.73)

#artificialintelligenceDec-10-2021, 21:45:17 GMT

Road signs "driving" you crazy?

You also might be wondering what all this CNN stuff is. Don't worry, I can explain A CNN is a type of neural network (read my other articles to learn the basics) that is particularly good with image classification. CNNs are used for computer vision because they are great at detecting patterns in images, such as lines, circles and other shapes and patterns. A CNN uses convolutional layers, which essentially learn filters that can detect patterns in an image. For example, a filter could detect vertical lines, or it could detect horizontal lines. These filters "convolve" over an image, going in little 3x3 (or whatever the size of the filter) chunks to get the dot product of said 3x3 chunk.

accuracy, neural network, validation, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

#artificialintelligenceMay-6-2021, 09:00:04 GMT

A Gentle Introduction to Audio Classification With Tensorflow

We have seen a lot of recent advances in deep learning related to vision and language fields, it is intuitive to understand why CNN performs very well on images, with pixel's local correlation, and how sequential models like RNNs or transformers also perform very well on language, with its sequential nature, but what about audio? In this article you will learn how to approach a simple audio classification problem, you will learn some of the common and efficient methods used, and the Tensorflow code to do it. Disclaimer: The code presented here is based on my work developed for the "Rainforest Connection Species Audio Detection" Kaggle competition, but for demonstration purposes, I will use the "Speech Commands" dataset. We usually have audio files in the ".wav" format, they are commonly referred to as waveforms, a waveform is a time series with the signal amplitude at each specific time, if we visualize one of those waveform samples we will get something like this: Intuitively one might consider modeling this data like a regular time series (e.g. stock price forecasting) using some kind of RNN model, in fact, this could be done, but since we are using audio signals, a more appropriate choice is to transform the waveform samples into spectrograms. A spectrogram is an image representation of the waveform signal, it shows its frequency intensity range over time, it can be very useful when we want to evaluate the signal's frequency distribution over time.

audio classification, gentle introduction, spectrogram, (15 more...)

Genre: Instructional Material > Course Syllabus & Notes (0.77)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

#artificialintelligenceSep-19-2020, 14:06:01 GMT

Machine Learning Pipelines with Kubeflow

A lot of attention is being given now to the idea of Machine Learning Pipelines, which are meant to automate and orchestrate the various steps involved in training a machine learning model; however, it's not always made clear what the benefits are of modeling machine learning workflows as automated pipelines. When tasked with training a new ML model, most Data Scientists and ML Engineers will probably start by developing some new Python scripts or interactive notebooks that perform the data extraction and preprocessing necessary to construct a clean set of data on which to train the model. Then, they might create several additional scripts or notebooks to try out different types of models or different machine learning frameworks. And finally, they'll gather and explore metrics to evaluate how each model performed on a test dataset, and then determine which model to deploy to production. This is obviously an over-simplification of a true machine learning workflow, but the key point is that this general approach requires a lot of manual involvement, and is not reusable or easily repeatable by anyone but the engineer(s) that initially developed it.

artificial intelligence, machine learning, pipeline, (16 more...)

Genre: Workflow (0.72)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)